Search CORE

15 research outputs found

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

Author: Baghdadi Riyadh
Bastoul Cedric
Cohen Albert
Pouchet Louis-Noel
Rauchwerger Lawrence
Publication venue
Publication date: 01/01/2010
Field of study

Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

From micro-OPs to abstract resources: constructing a simpler CPU performance model through microbenchmarking

Author: Bastian Théophile
Derumigny Nicolas
Gruber Fabian
Guillon Christophe
Iooss Guillaume
Pouchet Louis-Noel
Rastello Fabrice
Publication venue
Publication date: 19/01/2021
Field of study

This paper describes PALMED, a tool that automatically builds a resource mapping, a performance model for pipelined, super-scalar, out-of-order CPU architectures. Resource mappings describe the execution of a program by assigning instructions in the program to abstract resources. They can be used to predict the throughput of basic blocks or as a machine model for the backend of an optimizing compiler. PALMED does not require hardware performance counters, and relies solely on runtime measurements to construct resource mappings. This allows it to model not only execution port usage, but also other limiting resources, such as the frontend or the reorder buffer. Also, thanks to a dual representation of resource mappings, our algorithm for constructing mappings scales to large instruction sets, like that of x86. We evaluate the algorithmic contribution of the paper in two ways. First by showing that our approach can reverse engineering an accurate resource mapping from an idealistic performance model produced by an existing port-mapping. We also evaluate the pertinence of our dual representation, as opposed to the standard port-mapping, for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the Spec CPU 2017 benchmarks and comparing the throughput predicted by existing machine models to that produced by PALMED

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hybrid Iterative and Model-Driven Optimization in the Polyhedral Model

Author: Bastoul Cédric
Bondhugula Uday
Cohen Albert
Pouchet Louis-Noel
Ramanujam R.
Sadayappan P.
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

On modern architectures, a missed optimization can translate into performance degradations reaching orders of magnitude. More than ever, translating Moore's law into actual performance improvements depends on the effectiveness of the compiler. Moreover, missing an optimization and putting the blame on the programmer is not a viable strategy: we must strive for portability of performance or the majority of the software industry will see no benefit in future many-core processors. As a consequence, an optimizing compiler must also be a parallelizing one; it must take care of the memory hierarchy and of (re)partitioning computation to best suit the target architecture Polyhedral compilation is a program optimization and parallelization framework capable of expressing extremely complex transformation sequences. The ability to build and traverse a tractable search space of such transformations remains challenging, and existing model-based heuristics can easily be beaten in identifying profitable parallelism/locality trade-offs. We propose a hybrid iterative and model-driven algorithm for automatic tiling, fusion, distribution and parallelization of programs in the polyhedral model. Our experiments demonstrate the effectiveness of this approach, both in obtaining solid performance improvements over existing auto-parallelizing compilers, and in achieving portability of performance on various modern multi-core architectures

INRIA a CCSD electronic archive server

PALMED: Throughput Characterization for Superscalar Architectures

Author: Bastian Théophile
Derumigny Nicolas
Gruber Fabian
Guillon Christophe
Iooss Guillaume
Pouchet Louis-Noel
Rastello Fabrice
Publication venue: HAL CCSD
Publication date: 02/04/2022
Field of study

International audienceIn a super-scalar architecture, the scheduler dynamically assigns micro-operations (µOPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into µOPs and lists for each µOP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop. This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements. We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel's Skylake microarchitecture

INRIA a CCSD electronic archive server

PALMED: Throughput Characterization for Superscalar Architectures - Extended Version

Author: Bastian Théophile
Derumigny Nicolas
Gruber Fabian
Guillon Christophe
Iooss Guillaume
Pouchet Louis-Noel
Rastello Fabrice
Publication venue: HAL CCSD
Publication date: 18/01/2022
Field of study

In a super-scalar architecture, the scheduler dynamically assigns micro-operations (µOPs) to execution ports. The port mapping of an architecture describes how an instruction decomposes into µOPs and lists for each µOP the set of ports it can be mapped to. It is used by compilers and performance debugging tools to characterize the performance throughput of a sequence of instructions repeatedly executed as the core component of a loop.This paper introduces a dual equivalent representation: The resource mapping of an architecture is an abstract model where, to be executed, an instruction must use a set of abstract resources, themselves representing combinations of execution ports. For a given architecture, finding a port mapping is an important but difficult problem. Building a resource mapping is a more tractable problem and provides a simpler and equivalent model. This paper describes Palmed, a tool that automatically builds a resource mapping for pipelined, super-scalar, out-of-order CPU architectures. Palmed does not require hardware performance counters, and relies solely on runtime measurements.We evaluate the pertinence of our dual representation for throughput modeling by extracting a representative set of basic-blocks from the compiled binaries of the SPEC CPU 2017 benchmarks. We compared the throughput predicted by existing machine models to that produced by Palmed, and found comparable accuracy to state-of-the art tools, achieving sub-10 % mean square error rate on this workload on Intel's Skylake microarchitecture

INRIA a CCSD electronic archive server

Hybrid Iterative and Model-Driven Optimization in the Polyhedral Model

Author: Bastoul Cédric
Bondhugula Uday
Cohen Albert
Pouchet Louis-Noel
Ramanujam R.
Sadayappan P.
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

PALMED: Throughput Characterization for Superscalar Architectures

Author: Bastian Théophile
Derumigny Nicolas
Gruber Fabian
Guillon Christophe
Iooss Guillaume
Pouchet Louis-Noel
Rastello Fabrice
Publication venue: HAL CCSD
Publication date: 02/04/2022
Field of study

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server